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ABSTRACT 

In the era of rapidly increasing amounts of time series data, classification of variable 
objects has become the main objective of time-domain astronomy. Classification of 
irregularly sampled time series is particularly difficult because the data cannot be 
represented naturally as a vector which can be directly fed into a classifier. In the 
literature, various statistical features serve as vector representations. 

In this work, we represent time series by a density model. The density model 
captures all the information available, including measurement errors. Hence, we view 
this model as a generalisation to the static features which directly can be derived, e.g., 
as moments from the density. Similarity between each pair of time series is quantified 
by the distance between their respective models. Classification is performed on the 
obtained distance matrix. 

In the numerical experiments, we use data from the OGLE and ASAS surveys and 
demonstrate that the proposed representation performs up to par with the best cur¬ 
rently used feature-based approaches. The density representation preserves all static 
information present in the observational data, in contrast to a less complete description 
by features. The density representation is an upper boundary in terms of information 
made available to the classifier. Consequently, the predictive power of the proposed 
classification depends on the choice of similarity measure and classifier, only. Due to 
its principled nature, we advocate that this new approach of representing time series 
has potential in tasks beyond classification, e.g., unsupervised learning. 

Key words: techniques: photometric - astronomical data bases: miscellaneous - 
methods: data analysis - methods: statistical. 


1 INTRODUCTION 


The variation of the brightness of an astronomical object 
over time (hereafter called light curve or time series) is an 
important way to obtain knowledge and constraint proper¬ 
ties of the observed source. With the advent of large sky 
surveys such as the Large Synoptical Sky survey (LSST, 
Ivezi et al. 2011) the incoming data stream will be so im¬ 


mense that the applied methodology has to be reliable and 
fast at the same time. While the origin of variability can 
be very different, a huge fraction of the variable objects in 
the sky has a stellar origin. From those variable stars many 
show (quasi-) periodic behaviour and originate from the in¬ 
stability stripe in the Hertzsprung-Russell-diagram or are 
multi-star systems where the origin of the variability is the 
mutual occultation. The main focus of this work will be on 
periodic sources, but in principle the presented methodology 


can also be used for non-periodic sources (see e.g. Donalek 
et al.|2013 1. 


The classification performance of periodic sources is al¬ 
ready fairly high provided that the period and the amplitude 
of the variation are determined correctly ( |Bailey belaud] 
1899 Bailey 1902 Bono et al. 19971. But apart from the 


very soft boundaries between the classes, the quality of the 
period-finding algorithm depends on the type of variability 
ifself (Graham et al. |2013j ) and thus a dependency between 
those two properties is encountered. In order to break this 
dependency one can either rely on only (quasi-) static fea- 
turctQfor the classification or estimate the period and derive 
classifications by analysing the phase-folded light curves (see 
e.g. |Debosscher et al.|2007 and references herein). |RichardF 


|et al.| (2011) showed that the inclusion of static features 
yields an improved classification performance and that the 
contribution of the static and non-static features to the ac¬ 
curacy is of the same order. 

In this work, we introduce a novel representation of time 


1 Throughout this paper we will divide features derived by other 
authors in three categories: non-static: everything directly re¬ 
lated to period finding and features derived from the periodogram. 
quasi-static: features that treat the data as function instead in¬ 
stead of a time series, e.g. slope, linear trend, static: all features 
that treat the measured fluxes only as an ensemble and thus the 
temporal information is discarded, e.g. median, standard devia¬ 
tion. A complete list of features used here is given in Table [I] 
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Figure 1. Schematic view of all steps from the raw data to the 
classification method. 


series that aims to replace the static features. We represent 
each noisy data point by a Gaussian; the mean of the respec¬ 
tive Gaussian is the measurement and the standard devia¬ 
tion is given by the measurement uncertainty (photometric 
error). Hence, every time series is represented by a mixture 
of Gaussians that conserves all static information available 
in the data. We advocate that this a simple and natural 
choice. In contrast to that, features can be seen as deriva¬ 
tives (such as moments) of this density model and therefore 
only describe certain properties of it. For instance, |Lindsay] 


& Basak (2000) show that moments are just able to describe 


the tails of a distribution but do not necessarily give a good 
description of the underlying distribution. 

As a consequence, the proposed density-based repre¬ 
sentation presents an upper boundary to the static informa¬ 
tion content which can be made available to the classifier. 
The similarity of two densities is thereby judged using three 
widely used distance measures, the L2-norm, the Kullback- 
Leibler-divergence and the Bhattacharyya distance. These 
measures of similarity are then fed into two different classi¬ 
fiers. Finally, we compared the classification performance of 
the density- and feature-based approaches. 

The aim of this work is to introduce an alternative and 
more general notion of similarity between light curves, which 
correctly takes into account measurement uncertainty. In the 
new representation all static information contained in the 
observations are conserved in a more principled way and ad¬ 
jacently fed to the classifier. Consequently, we expect that 
this new representation provides a reference in terms of clas¬ 
sification performance. 

In Section [2] the new representation and its respective 
application to the classifier are described. After describing 
the used data in Section [ 3 ] the results of two different ex¬ 


periments are presented in Section 
discussion of our approach in Section 


nfflj 

ion 5] 


We conclude with a 


2 METHOD 

In this section the methodology is described. A sketch of 
the entire classification process is shown in Figure [T] Each 
step is annotated with the respective subsection in the text; 
the FCLC software which includes all steps described in the 
following is available at http://ascl.net/1505.014 


2.1 Converting data points into densities 

The key idea of our method is to convert the individual 
data points with their errors as a continuous density. We 
treat each data point as a normal distribution with a mean 
p equal to the magnitude y and a width a equal to the 
photometric error Ay of the respective measurement. This 
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Figure 2. The principle of the conversion to densities, every point 
is described by a normal distribution which are then added up to 
a PDF. 


allows us to convert the discrete M number of observations 
into a continuous density by using: 

1 M 

PDF (x) = — ^A/ - (x | y t , A yi ) (1) 

i =1 

where N (p, a) is the normal distribution with expectation p 
and width a, which returns the probability of the occurrence 
for a given value x. Each light curve is, after subtracting 
the median, converted to such a probability density function 
(PDF)-, a visualisation of this process is shown in Figure [2] 
This idea was already mentioned in the work of Aherne et al.l 
(19981. 


2.2 Parsimonious mixtures of Gaussians 

An important step to make the computation of the distances 
computationally feasible is to reduce the number of Gaus¬ 
sians in the mixture of Gaussians (MoG). The look-up func¬ 
tion for individual values of x scales linearly with the number 
of Gaussians in the mixture. The computation of the dis¬ 
tance between every two densities scales directly with the 
number of Gaussians m in each density. Thus the compu- 
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tational complexity for computing the distance matrix is of 
the order of O (n 2 m ), where n is the total number of light 
curves to be classified. This computation gains a significant 
speedup by reducing the number of Gaussians; reducing the 
number of observations (typically M = 300) to a mixture of 
Gaussians with m — 20 components yields an effective gain 
in speed of (^y) « 15. 

We tested several ways described in the literature to 


reduce the number of Gaussians effectively (Crouse et al. 


2011). After experimenting with the different methods we 


found that the method by Runnalls (20061 yielded the most 
satisfactory results. The basic idea is that two similar (in 
terms of the Kullback-Leibler-Divergence, see below) Gaus¬ 
sians can be approximated by a single normal distribution. 
The dissimilarity between two normal distributions with am¬ 
plitudes Wo;Wi means ^o;Mi and widths ao;ai is thereby 
measured by 


D = 0.5culog ( Co o — ir- t- + £>i 
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with u> = ao + cri; uio = —; Cb\ = yy. The pair of normal 
distributions with the closest distance D is then merged into 
a new single Gaussian with weight Woi = Wo + Wi, expec¬ 
tation fj ,oi = + iv^Mi and variance ah = + 

^w~ a i + 1 (mo — Mi) 2 - The search and replacement is 

then performed iteratively until the desired number of new 
components is reached. An example of a reduced MoG is 
shown in the bottom plot in Figure [2] Apart from the de¬ 
creased computational complexity the reduction in number 
of components used in the MoG has yet another very in¬ 
teresting side effect. Due to the loss of information the new 
PDF is always just a smoothed version of the density-based 
on the real data. As the data are irregularly sampled this 
smoothing is effectively a better representation of the true 
underlying density. Obviously, the number of Gaussians to 
be used is a parameter which has to be optimised. Here, it 
will be optimised by maximizing the classification accuracy 
for a given dataset and classifier. 

Another aspect to mention is the conservation of out¬ 
liers. Since iteratively only the most similar Gaussians are 
merged into a single one, the presence and probability of 
outliers will remain unchanged throughout this procedure. 


2.3 Similarity of probability densities 

After converting all light curves to PDF, we apply different 
measures of similarity between two given probability densi¬ 
ties P(x), Q(x). As light curves differ in apparent magnitude 
we subtract the median magnitude in order to align the den¬ 
sities of different objects. 


2.3.1 L2-norm 

The most obvious choice for comparing two densities is the 
L2-norm, defined as 

L2{P{x),Q{x)) = J (P(x)-Q(x)) 2 dx . (3) 

While the L2-norm is a very robust and reliable measure of 
the similarity, it is not very sensitive to faint tails as dif¬ 
ferences in the main component are penalized more heavily. 


But, as stated in the introduction, the tails contain the vast 
majority of information of a density. Hence, we do not ex¬ 
pect the L2-norm to be a good distance measure for our 
classification problem. 

2.3.2 Bhattacharyya distance 

The Bhattacharyya distance ( BHA ), defined as, 

BHA(P(x),Q(x)) = — log J \JP(x)Q(x)dx (4) 

is a generalisation of the Mahalanobis distance which, in 
contrast to the latter one, takes into account the difference 
in shape. The Bhattacharyya distance has been used in clas¬ 
sification problems before (see e.g.,[Ah erne et aT| l 998| and 
thus seems a very good choice for our method. 

2.3.3 Symmetrised Fullback-Leibler divergence 
The Kullback-Leibler divergence ( KLD ), 

KLD ( P(x ), Q(a :)) = J P(x) log dx (5) 

is a measure of similarity of two probability densities in in¬ 
formation theory. It consists of two terms one being the en¬ 
tropy (information content) of P(x) and a term which is the 
expectation of log ( Q(x )) with respect to P{x). The second 
term is the log-likelihood that the observed density Q(x) was 
drawn from the model density P(x). The KLD is capable 
of describing also difference between densities in faint tails. 
The KLD itself can not be treated as a distance directly 
since - even though it returns zero for identical densities - 
it is not symmetric. We circumvent this problem by simply 
symmetrising the KLD and thus, we finally compute 

KLD ayln (P(x), Q(x)) = KLD (P, Q) + KLD (Q, P) . (6) 


2.4 Computation of distances 


The KLD and the BHA can not be computed analytically 
for two MoG and thus must be approximated by performing 
the integration. Even though, analytical approximations ex¬ 
ist for the KLD (see Durrieu et al. 2012 and references 
herein), we encountered numerous difficulties when using 
them in practice, e.g., as non-positive distances. For this 
reason, we decided to perform the integration for all dis¬ 
tances numerically. For our one-dimensional case, we found 
the following numerical integration to be sufficient 


/ OO 

F(x) dxxA{F (x 0 ) + F (x 0 + A) + ... + F (xfi) . 

-OO 

( 7 ) 

The integration above is performed from —oo to +oo. Here 
the integral is numerically approximated and therefore a fi¬ 
nite range must be defined. The lower and the upper bound¬ 
ary are chosen very generously by integrating from xo = 
fj,(i) — 5a (i), with i = argmin/r (*) to x\ = /i (i) + 5a (i), 

ieMoG 

with i = argmax/r (j). In order to retain the same preci- 

i£MoG 

sion for all integrals we chose the integration width A for 
all integrations to be the same. To be on the safe side, we 
set A = 0.001 but when experimenting with this width it 
turned out that A = 0.005 is sufficiently small to minimise 
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Figure 3. The effect of an outlier added to the distribution in 
Figure [2] is shown. The x-axis gives the magnitude of the in¬ 
jected point with respect to the median, the left y-axis are the 
KLD/BHA distance in the upper plot and the L2/BHA dis¬ 
tance in the lower one. The right y-axis denotes the ratio of the 
respective distance measures. 


Feature 

Moment 

Data 

Feature 

Model 

Feature 

Amplitude 

A = 0.5 • £0.95,0.05 

0.150 

0.141 

Beyond1St d 

1 - f ai+,T2 P(x)dx 

0.446 

0.435 

FPRMid20* 

£0.60,0.40 / £0.95,0.05 

0.325 

0.309 

FPRMid35* 

£0.675,0.325/3:0.95,0.05 

0.479 

0.468 

FPRMid50* 

£0.75.0.25/£0.95,0.05 

0.625 

0.631 

FPRMid65* 

£0.825,0.175/ £0.95,0.05 

0.804 

0.789 

FPRMid80* 

£0.90,0.10/£0.95,0.05 

0.904 

0.899 

Skew 

73/02 

-0.185 

-0.182 

SmallKurtosis 

< 74 / 0-2 — 3 

-1.365 

-1.361 

MAD 

xmad 

0.083 

0.083 

MedianBRP 

r°.5 + ^/|P(£)d£ 

Jaio. 5 —A/5 v > 

0.142 

0.126 

Percent Ampl. 1 

incl. median of LC 

0.013 

- 

PDFPt 

incl. median of LC 

0.021 

- 

StetsonK^ 

incl. # observations 

0.894 

- 


*FPR: FluxPercentileRatio, x f g — x f — x g 
t Features without equivalent model description 

Table 1 . Computed features from observations and from the den¬ 
sity model with respective formulas of an example light curve. 


computation time (scales with A -2 ) without any loss in ac¬ 
curacy. Obviously, a good estimate for A is given by taking 
a fraction of the typical standard deviation in the mixture 
of Gaussians, as then the integration resolution is below the 
typical scale width of the density. To prevent the integration 
from encountering ill defined (that is, negative values in the 
square root or log) values we add a small constant to each 
of the densities which does not yield any measurable impact 
on the final classification. 

In Figure[3]the impact of the injection of a single outlier 
on the L2, BHA and I\LD with respect to its injection posi¬ 
tion (x-axis) is shown. Therefore a single measurement value 
with a typical photometric error is inserted into the distri¬ 
bution from Figure [2] and the distance to the undistorted 
distribution is computed, respectively. It is evident, that the 
KLD reacts way more heavily to a single outlier, which will 
eventually also limits its use for classification task, as shown 
in the results. Note, that in principle the KLD distance 
would diverge to infinity; the plateau is just encountered 
due to the added small constant, mentioned above. 


2.5 Relation to features 

Our density representation directly relates to the feature^ 
used in Richards et al. (201lJ; a detailed definition of all the 
used features is given in the Appendix [A] As shown in Ta¬ 
ble [I] we can recover all but three features directly from the 
density, except for StetsonK, Percent Amplitude, PercentD- 
ifferenceFluxPercentile (PDFP). StetsonK contains the dis¬ 
crete number of observations as one of its input parame¬ 
ters, the latter two the absolute median value of the magni¬ 
tudes. No equivalent measures for those exist for our median- 
subtracted and normalised densities. 

To explain all the other features we use the common 


2 To compute the features we use the python package provided at 
http://isadoranun.github.io/tsfeat/FeaturesDocumentation.html 


also used in 


Nun et al. 


(2014). 


notion of moments of a density 


O' n 



x n P ( x ) dx 


( 8 ) 


with <7o = 1, <Ji being the mean, 02 the standard deviation 
and so on. Another frequently used integral is the percentile, 
Xf, where the density contains a certain fraction /, defined 
by 


Xf : 



dx = f. 


(9) 


Additionally, the median absolute deviation (MAD) is de¬ 
fined as 

r x MAD 

xmad : / P {x — X 0 . 5 ) + P (* 0.5 — x) dx = 0.5 (10) 

Jo 

where £ 0.5 is the median. One can see that most features 
can be expressed in terms of our PDF and thus the com¬ 
puted density contains most of the information encoded in 
the features. 


2.6 Classification 


In this subsection we are describing the functionality and 
use of the different classifiers applied in this work. The first 
two of those classifiers depend actually on a distance matrix 
which is the direct outcome of our distance measure. For 
the features the distance matrix is created by computing 
the euclidean distance 


D ( v, w) 



n= 1 


(ii) 


between two feature vectors v, w. For the interested reader, 
more details on the used classifiers can be found in IHastiel 


et al. (2009). We use the implementations provided in the 
python package scikit-learij^] To exclude effects originating 
from the preprocessing of the features, we also classified the 


3 http://scikit-learn.org/ 
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light curves with a min-max normalised version of the fea¬ 
tures. In the following, the applied classifiers k nearest neigh¬ 
bors (kNN) and the support vector machine (SVM) are ex¬ 
plained in more detail. 


2.6.1 k nearest neighbours 

Once the distances between all light curves are computed, 
we can sort the matrix for each candidate light curve and 
look at the types of the closest reference light curves. The 
only free parameter is fc, the number of neighbours chosen 
per test light curve. Another degree of freedom can be in¬ 
troduced by weighting the distances to the neighbours, e.g., 
decaying distance. In practice, we obtained no significant 
gain and thus we use a classical majority vote. If the num¬ 
ber of objects of a certain class is equal for two (or more) 
different classes, a random class out of those is assigned. 


2.6.2 Support vector machine 

A slightly better performance in classification can be reached 
if the distance matrix is used as the kernel of a support 
vector machine (SVM). In this work, we use the radial basis 
function (RBF) kernel which reads 

Ki, = exp A;,) (12) 

with D being the distance matrix and 5 the bandwidth. As a 
consequence, low distances will have a kernel value close to 
unity and distances significantly larger than 5 will be close 
to zero. We take the i'-SVM as the kernel classifier. Two 
parameters have to be tuned in a J/-SVM, namely the kernel 
width 5 and the width of the soft margin u. The soft margin 
controls the fraction of mis-classifications in the training of 
the classifier. 


2.6.3 Random forest 


Given the success in Richards et al. (2011), we use the ran¬ 


dom forest (RF) as a candidate classifier as well. RF extends 
the concept of a single decision tree by using an ensemble of 
randomised decision trees. Unfortunately, by its very nature, 
this classification method can only be used on features. At 
each node of a tree the features are split such that the infor¬ 
mation content (entropy) is maximised at each decision. The 
dominant free parameter in a RF is the number of decision 
trees, which is the only one considered in this work. 


2.7 Performance and optimisation 

Each of the classifiers presented, has several free parame¬ 
ters to be optimised but we stick for all the methods with 
the most important ones. For the kNN comparison this pa¬ 
rameter is the number of investigated k nearest objects, the 
I'-SVM classifier has the tunable softening parameter v and 
the kernel width <5 and eventually the RF can be build up of 
T number of trees. While other parameters (e.g. tree depth 
in RF) might have an impact on the classification quality, it 
is not the aim of this work to investigate this possible gain 
with the choice of these parameters. Also the process of fea¬ 
ture selection is skipped and throughout this work always all 


classifier parameter 


range 


kNN 


y-SVM 


RF 


# neighbours k 
weights 

softening parameter v 
kernel width <5 
kernel 

kernel degree 

number of trees T 
split algorithm 
max. # of features 


1, 2, ..., 30 
uniform (fixed) 

0 .01, ..., 1.00 (adaptive) 
0 .01, ..., 100 (adaptive) 
RBF (fixed) 

3 (fixed) 

100 , 200 , . .., 1000 
gini (fixed) 
all (fixed) 


Table 2. Overview over classifier parameters to be optimised 


features defined in Table [I] are used. All the parameters are 
evaluated for each classifier and data set independently on a 
fixed grid and the respective value with the highest accuracy 
is eventually chosen. A summary over the tuned parameters 
and their respective search ranges, as well as all parameters 
that have not been optimised, are shown in Table [2] 

We judge the performance of a classifier by computing 
the accuracy defined as the mean fraction of correctly clas¬ 
sified targets over a 10-fold cross-validation; the uncertainty 
in accuracy is given by the standard deviation. In addition, 
we compute the confusion matrix of the best classifiers to 
investigate possible caveats in the presence of multiple and 
unbalanced classes. 


3 DATA 

We conduct experiments with the different representations 
and classifiers on two datasets. This has the advantage that 
we have two independent measures for the predictive power 
of our method. In the first experiment, three classes are to 
be separated; in the second a more complex seven class clas¬ 
sification is performed. In fact, in the former dataset the 
classes are defined more broadly (e.g., no distinction be¬ 
tween different binary classes) and thus it is expected that 
the classification accuracy will be higher than in the latter 
case. It is the aim of this experiment to show, that our classi¬ 
fication algorithm can perform comparably well to state-of- 
the-art classifiers for very broad and detailed classification 
tasks alike. 


3.1 OGLE 


The Optical Gravitational Lensing Experiment (OGLE, 
Udalski et al][20081 is a survey originally dedicated to the 
search for microlensing events and dark matter. Therefore, 
stars of the Magellanic clouds and the galactic bulge were 
monitored for the unique traces of microlensing events. Con¬ 
sequently, millions of stars have been monitored, delivering 
a rich database of variable stars. In our work, we use the 
dataset used in Wang et al. |2lll2 p| where some RRLyrae, 
eclipsing binaries and cepheids in the Magellanic clouds were 
extracted from the OGLE-II survey. The objects selected 
were known to be periodic before and thus their period 
was known as well. In the publication, the determination 
of the period is the main goal, but the database presents a 
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Survey 

VarType 

Entities 

(#obs) 


Cepheids 

3567 

225 

OGLE 

Eclipsing binaries 

3929 

330 


RR Lyrae 

1431 

323 


MIRA 

2833 

342 


ED 

2292 

570 


RR Lyrae AB 

1345 

412 

ASAS 

EC 

2765 

524 


ESD 

893 

547 


DSCT 

566 

492 


DCEP-FU 

660 

561 


Table 3. Types of variables, number of entities and average num¬ 
ber of observations. 

good test bed for classification as well, since a correctly de¬ 
termined period favours also good classification results and 
thus the classification is very reliable. The total number of 
objects is listed in Table [3] Some of the files contain lines 
with invalid entries, that is a few lines with a measurement 
error of zero, which have been removed. 



Figure 4. Classification accuracy versus the number of Gaussian 
components using the kNN classifier on both datasets. Accuracy 
is largely insensitive to the exact choice of the number of Gaussian 
components. 


3.2 ASAS 


The All Sky Automated Survey (ASAS, Pojmanski 1997) 
is performed with telescopes located on Hawaii and in Las 
Campanas and is lead by the Warsaw university in Poland. 
The sky is observed in the I and V band with an initial limit 
of 13 mag (later extended to 14mag). In 2005 the ASAS cat¬ 
alog of variable stars (ACVS, |Pojmanski et af7||2005[ ) was 
published which is the starting point for our experiment. 
From the ACVS we extracted all objects with a unique clas¬ 
sification which is not miscellaneous. Subsequently, we re¬ 
moved all light curves having less than 50 observations and 
all classes with less than 500 members. A summary of the 
classes used can be found in Table(3] For the classification we 
used the magnitude “Mag_2” (which corresponds to a 4 pixel 
aperture) which is a reasonably good measure of the bright¬ 
ness for fairly bright and faint stars. Due to the extension 
to the faint end, the classes given could inherent some false 
classifications itself, especially since also subclasses (e.g., de¬ 
tached and contact binaries) are annotated and hence, it is 
expected, the class assignment in the given catalog is not as 
reliable as in the OGLE case. 


4 RESULTS 

As stated in the methodology, the impact of the reduction 
of the number of Gaussians has to be quantified. In Figure 
[4] we show empirically that the impact on the final accuracy 
is only marginal, as long as the number of components ex¬ 
ceeds 10. For all conducted experiments we fix the number 
of Gaussians to 20. 

In Table [4] and [5] the results of the different experiments 
for the OGLE and ASAS data set are shown, respectively. 
Since the L2-norm performs, independently of the chosen 
classifier, always worse than the BHA and KLD metrics, 
we exclude it from the discussion in the following. 

For the OGLE experiment we see that each method 
(feature and density methods) performs comparably well 
within the typical deviation between the 10 cross-validation 



kNN 

1 /-SVM 

RF 

Features 

95.09 ±0.74 

96.86 ±0.52 

95.61 ± 0.82 

(raw) 

k=8 

v =0.04, 5 =0.31 

T=500 

Features 

95.51 ± 0.81 

96.88 ±0.67 

95.59 ± 0.83 

(norm.) 

fc=10 

v =0.06, 5 =0.08 

T=500 

L2 

93.44 ± 0.88 

95.92 ±0.68 

— 

k =3 

v =0.06, (5 =0.69 

- 

KLD 

95.14 ± 0.70 
k =5 

95.51 ± 0.94 
v =0.14, S =0.33 

- 

BHA 

94.84 ± 0.83 

96.01 ± 0.71 

— 

k=7 

v =0.08, 6 =0.14 

- 


Table 4. Results for the optimal classifiers for the 3 class classifi¬ 
cation of OGLE data. The performance is the average fraction of 
correctly classified objects in a 10-fold cross-validation with the 
standard deviation of this performance being the error (all given 
in per cent). 


folds. It is worth noting, the RF, claimed to be the best 
classifier in Richards et al. ( 201 1| ) , does not perform any 
better than the other classifiers. It is further interesting to 
see that the feature-SVM is performing slightly better than 
the SVM based on the density representation. As mentioned 
in Section [2] three features exist which cannot be described 
by the density-based approach. When removing those re¬ 
spective features from the feature list, the accuracy of both 
feature-SVMs drops by one per cent, indicating that the dif¬ 
ference in accuracy does originate from those. The strength 
of the variation with respect to the median observed bright¬ 
ness appears to bear some information about the type of 
variability. We elaborate further on this issue in the discus¬ 
sion section. 

That the impact of those median-based features is any¬ 
way not too high is supported by the results of the seven 
class ASAS classification. It becomes apparent that the more 
generic definition of the density enhances the accuracy in 
contrast to all the feature-based classifiers. The confusion 
matrices of the best classifiers from the density and feature 
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kNN 

I/-SVM 

RF 

Features 

74.22 ± 1.24 

78.02 ±0.68 

79.98 ± 1.16 

(raw) 

k=11 

v =0.19, 5 =0.53 

T=400 

Features 

77.60 ± 0.76 

80.47 ± 1.21 

79.99 ± 1.55 

(norm.) 

fc=17 

v =0.17, 5 =0.10 

T=400 

L2 

79.57 ±0.80 

82.08 ±0.89 

_ 

fc=19 

v =0.01, 6 =0.56 

- 

KLD 

78.96 ± 1.87 

75.56 ± 0.94 


k =23 

v =0.26, 5 =0.34 

- 

BHA 

79.73 ± 0.83 

81.11 ± 0.90 

- 

k =29 

v =0.20, <5 =0.14 

— 


Table 5. Results for the optimal classifiers for the 7 class classifi¬ 
cation of ASAS data. The performance is the average fraction of 
correctly classified objects in a 10-fold cross-validation with the 
standard deviation of this performance being the error (all given 
in per cent). 
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Figure 5. The accuracies (given in per cent) of the feature 
based (left) and density (All/1-mel.ric) based r-SVM classifier 
are shown for the OGLE dataset. The x-axis shows the labels 
according to the classifiers, the y-axis the given ones; the colour 
scale stands for the respective accuracy; from zero (red) to hun¬ 
dred (green) per cent. 

based classification are shown in Figure [5] [6] It can be seen 
that classes with more members achieve a higher accuracy 
which is expected due to the higher number of training ob¬ 
jects. Otherwise, no significant biases in any direction be¬ 
tween the two different classification approaches can be de¬ 
tected. While the gain in accuracy is again only marginal, 
it can be shown that the same quality is only reached if the 
three features, not describable by the densities, are included. 
Else the classification rates of the feature-based SVMs drop 
again by one per cent. Apart from this, it can be observed 
that the classification quality of the density-based classifiers 
depends quite strongly on the choice of the distance metric. 
The KLD does perform in three of the four experiments 
worse than the BHA which supports the statement that 
the BHA distance is a good distance measure for classifica¬ 
tion tasks. On the other hand, one should realise that the 
choice of the metric, that is the distance between two given 
feature vectors, is in principal also a free methodological fac¬ 
tor in the classification problem. Apart from the standard 



Figure 6. The accuracies (given in per cent) of the feature based 
(left) and density (L2-metric) based 1 /-SVM classifier are shown 
for the ASAS dataset. The x-axis shows the labels according to 
the classifiers, the y-axis the given ones; the colour scale stands 
for the respective accuracy; from zero (red) to hundred (green) 
per cent. 

euclidean distance and the Mahalanobis distance no other 
measures have been investigated in the literature. 


5 DISCUSSION 

In this work, we present a generalisation of static features 
for the classification of time series. In contrast to previous 
work, we do not rely on describing static densities with a 
set of features but use the densities themselves to measure 
the similarity between two light curves. By doing so, we can 
reduce the number of degrees of freedom in the methodol¬ 
ogy from four (preprocessing — feature selection — choice 
of metric — choice of classifier) to two (choice of metric — 
choice of classifier). This allows us to skip the step of fea¬ 
ture selection. The proposed approach follows first principles 
by simply assuming a model for representing the data; once 
a metric is chosen, classification in a kernel setting follows 
naturally. The strong point of the newly proposed represen¬ 
tation is the fact that it captures all the information present 
in the data (including measurement errors) and makes it 
available to the classifier. 

As highlighted in the results, the choice of the metric 
used in the density representation plays an important role. 
A priori, we are not aware of any natural choice of a met¬ 
ric. We have shown in our experiments that the BHA and 
L2 distance are performing very well in terms of accuracy. 
In principle, other (or combinations of) metrics might exist 
that are more suited for a given classification problem. 

Our approach presents a different way of performing 
classification. Therefore, it provides independent evidence 
that the widely used features are indeed well chosen for the 
classification problems considered so far. However, it is un¬ 
clear how well the chosen features generalise to other classi¬ 
fication problems. On the other hand, the density represen¬ 
tation is formulated generically and encodes all information 
available in the data. Additionally, the proposed method 
naturally encodes also uncertainty in the measurements, 
which is not taken into account in the feature-based ap¬ 
proaches so far. As a consequence, it is now possible to learn 
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a classification on data of one survey that contains small 
(large) measurement uncertainty, and predict on data of an¬ 
other survey with large (small) photometric error. While this 
is problematic for feature-based approaches, it is automati¬ 
cally taken care of in the density representation. 

We have shown that the feature- and density-based ap¬ 
proaches perform comparably well in terms of accuracy for 
the given datasets. As aforementioned, there are three fea¬ 
tures that cannot be derived from the density representation 
which appear to increase the classification accuracy. In par¬ 
ticular, the StetsonK value depends directly on the number 
of observations in a light curve, and for this reason it cannot 
be derived from our representation. It is questionable why 
the number of observations should be a defining property 
of a class. The only reason why it contributes to the per¬ 
formance is because certain classes are apparently observed 
more often than other ones (see Table |3j, and not because it 
is an inherent physical property. In Figure[7]we show that it 
is possible to classify ASAS light curves into MIRA and de¬ 
tached binaries with a 75% accuracy solely using the number 
of observations that happened to be recorded. The bright¬ 
ness of stars that vary over a wide range of magnitudes, such 
as MIRA, will frequently drop below the survey-specific de¬ 
tection limit. Hence, faint observations will not be recorded 
in the database. This raises the following problem: absent 
recordings are ambiguous because it is not evident whether 
the source was too faint to be detected or simply not ob¬ 
served. As a consequence, the number of observations, and 
thus StetsonK, hints to the variability type of a star within 
one survey. However, this feature is survey-dependent as 
surveys differ in database structure (e.g., some give upper 
limits) and detection limits, and thus does not generalise. 
If non-detections are not treated accordingly, the definition 
and use of the StetsonK value can cause dramatic bias on 
the classification, especially when knowledge is transferred 
between different surveys, as done, e.g. in |Blomme et al.| 
(2010). Similarly, the Percent Amplitude and the PDFP di¬ 
rectly depend on the apparent magnitude of the respective 
object, which is also not an inherent property of a class. 
Conclusively, the only reason why these three features con¬ 
tribute to the accuracy is because of the presence of a (or 
several) bias in the observations and not because they cap¬ 
ture physical characteristics of the data. We do not state 
that the features in question are useless for classification 
(indeed they increase the accuracy), but argue that they do 
not generalise and are therefore not useful for knowledge 
transfer between surveys with different observational bias. 
It should be considered, to redefine these features accord¬ 
ingly, such that they do not rely on the observation strategy 
of a survey. 


In summary, the proposed method (a) introduces a more 
general notion of distance between light curves in contrast 
to static features, (b) naturally incorporates measurement 
errors, (c) performs equally well as state of the art feature- 
based classifications and (d) yields an independent measure¬ 
ment of the accuracy as compared to feature-based classifi¬ 
cation. 

As a future prospect, the density-based representation 
could be useful in unsupervised settings where the notion of 
distance is more critical in the absence of labels which are 
the driving force in a classification task. Feature sets that 
have been optimised for classification do not necessarily pro¬ 


MIRA 



Number of observations 


Figure 7. A histogram over the number of observations for the 
MIRA and detached binaries classes in the ASAS survey are 
shown. The number of observations clearly correlates with the 
class label: if a bisectional line is introduced at 427 observations, 
a classification rate of 75.1% can be reached. 


vide a good similarity measure. In subsequent work, we will 
investigate whether the proposed notion of distance natu¬ 
rally distinguishes between the different variability types. 
Additionally, we advocate that besides static features also 
temporal information should be incorporated in a similar 
vein. However, the design of such a time-dependent repre¬ 
sentation remains an open question. 
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StetsonK More robust measure of the kurtosis, as de¬ 


fined in Stetson (1996). 


APPENDIX A: DETAILED DESCRIPTION OF 
FEATURES 

In the following, we give a detailed description of the static 
features used in [Richards et al. ( |201 1). The computation 
of the software is done using the Python FATS package, 
available under https://pypi.python.org/pypi/FATS, The 
error in the definition of the StetsonK value in older versions 
was corrected manually. 

Amplitude Absolute difference between highest and 
lowest magnitude. 

BeyondlStd Fraction of photometric points that lie 
beyond one standard deviation with respect to the (with 
photometric errors) weighted mean. 

FluxPercentileRatio (FPR) Relative difference of flux 
percentiles with respect to the 95 to 5 percentile difference. 
The number after the FPR gives the width of the percentile, 
always centered on 50, e.g., FPR20 = ^pfrzrpf- 

Skew The skew of the distribution of magnitudes. 

SmallKurtosis Kurtosis of the magnitudes for small samples. 

Median absolute deviation (MAD) Median deviation 
of the absolute deviation from the median. 

Median buffer range percentage (MedianBRP) Frac¬ 
tion of data points lying within one tenth of the amplitude 
around the median. 

Percent Amplitude Largest absolute difference from the 
median magnitude, divided by the median magnitude itself. 

PercentDifferenceFluxPercentile (PDFP) The 95 to 5 
flux percentile difference, divided by the median of the flux. 










